Supervised Learning

Ensemble Learning and Decision Trees


Dr Richard Diamond
May 2022

In this session...

Ensemble models

Ensemble models are machine learning models that use more than one predictor to arrive at a prediction. A group of predictors form an ensemble. In general, ensemble models perform better than using a single predictor (model) with improved efficiency and accuracy.

Ensemble learning can be classified as

  • Bagging ensemble learning also referred as parallel ensemble
  • Boosting ensemble learning also referred as sequential ensemble

Rather than used as standalone classifier model, decision trees most commonly utilised to construct bagging and boosting pipelines -- in this case, random forests. Therefore, you will often see use cases of Gradient Boosting/XGBoost and AdaBoost. Extreme Gradient Boosting will be covered in a separate Python Lab (XGBoost on the collection of features around S&P500 price).

Assumedly, decision trees have a higher likelihood to accentuate different features (compared to other ML/Supervised Learning techniques). Therefore, different boosted samples will produce very different decision trees -- a regression model analogy of that is increasing/reducing to near zero its coefficients..

Decision trees also a default choice for the alternative model -- we will see it when creating the pipeline Bagging of Logistic Regressor. Trees will produce a richer set of outcomes. Consensus of independent weak learners: random forests can improve on bagging if the correlation between the sampled trees is reduced.

There is a possibility to create Ensembles of Neural Nets, for which Convolutional Neural Networks are utilised and represent state-of-the art models, which use better-than-averaging ensembling.

Further arguments for ensemble models (Lee et al. 2015).

  • Bayesian Model Averaging -- ensembles are a finite sample approximation to integration over the class of models/type of estimators.
  • Model Combination -- ensembles enrich the space of hypotheses considered by the base model class and are representationally richer.

  • Estimation and Optimization Error Reduction -- ensemble averaging reduces the variance of base models, averaging out variations due to objective function's non-convexity and stochastic learning.

Logistic Regressor to Bagging (Part 1)

The most common model to predict the probability of default is Logistic Regression (also known as logit). There is developed Maximum Likelihood Estimation theory and math that supports the work with the binary dependent variable: {0,1} set of survival/default outcomes. We will review that logistic model's MLE in the relevant and yet another separate Tutorial (Asset price direction prediction with simplified classifier methods).

There is also a special function -- logistic sigmoid -- that converts regression output into a probability figure.

Imbalanced Samples and Bagging Regressor (Part 2)

When the Logistic Regression alone achieves poor prediction in the Default class label {1}, we invoke the standard trick of a data scientist -- the bagging of a regresssor. This is a meta-estimator that reduces the variance of a black-box estimator (eg, decision tree can be elaborate and hard to explain why particular spits chosen). Randomisation intentionally introduced into construction procedure (either random subset, or randomly breaking the model!), resulting in the ensemble of models.

A bagging regressor fits base regressors each on random subsets of the original data and then, aggregate predictions for the individual line (by voting/simple averaging) to form the single final prediction.

Boosted Decision Trees (Part 3)

Tree-based models are a class of ML/supervised learning models that are used for both classification and regression. In essence, the tree is a series of binary questions. The output of a trained model resembles a tree with branches, nodes and leaves (final nodes) -- each node contains a combination of class labels but preffered to be 'pure' or entropy-minimising.

Overfitting is not very difficult to achieve for a decision tree classifier, especially when utilised as one tree and give explanatory benefits when visualised, while visual synthesis across a Random Forest is out of reach.

Decision trees are very adaptive to the dataset, not many branches are required to zero-in on MinNumInLeaf of 1-2 observations. We will see in our case how in the Corporate_PD dataset with 72 default instances out of 4,000 annual observations -- one decision tree quickly isolated those instance such, that 100% accuracy was achieved. Without the information about in-sample error and training error -- we can't even begin to make projections about how wrong this tree will behave out-of-sample.

AdaBoostM1 was the first successful boosting algorithm developed for binary classification. AdaBoost is a good starting point to study the simple scheme of boosting and related maths. The large `however' is in the strategy of AdaBoost type of algorithm to progressively focus on the observations that yield the largest prediction errors. Focusing on better prediction for the outliers might turn out not to be the best strategy.

Material to review and study

CQF delegates can refer to Ensemble Methods lecture (quick introduction of decision tree principles and AdaBoost maths), then Section 3 of DT Theory and Stock Selection - T Guida.ipynb. After that, move on to Section 4 of that notebook as well as Python Lab on Extreme Gradient Boosting.

Sections 1.3 - 1.4 of DT Theory and Stock Selection - T Guida.ipynb discuss how you should analyse and prune an individual tree. Discussion of pruning criteria overlaps with hyperparameter tuning, which we make explict through exploring parameter space in this notebook.


In [ ]:
 
In [180]:
import warnings
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
plt.style.use('fivethirtyeight')

from IPython.display import Image
In [90]:
# Preprocessing & Cross validation
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, RandomizedSearchCV, TimeSeriesSplit, cross_val_score, GridSearchCV

# Metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Loading data from Excel files

pandas can read Excel files as well as fancier xlrd and the newer openpyxl Python packages.

  • We are going to use read_excel from pandas, input parameters described here
  • openpyxl operates with Excel Workbook in memory and can write back into file

Access to databases with credit default information is tricky. We read from Corporate_PD.xlsm from the older but nonetheless indicative implementation of Estimating Default Probability that takes from Altman Z-score model and extends estimation to full Generalised Linear Model (logistic regression).

In [2]:
xl = pd.ExcelFile('data/Corporate_PD.xlsm') 
xl.sheet_names
Out[2]:
['Summary', 'Logit', 'Inference', 'Restricted Model', 'Probit']
In [3]:
AltmanScores_names = [['WC/TA', 'Working Capital/TA'], ['RE/TA', 'Retained Earnings/TA'], ['EBIT/TA', 'EBIT/TA'], ['ME/TL', 'Market Cap/Total Liabilities'], ['S/TA', 'Sales/TA'], ]
#also ['Const','Intercept'] # scores_altman[0][0]

# usecols stopped working after pandas 0.20.0
# this has been fixed but NOT IN pandas 0.23.0 install
# https://github.com/pandas-dev/pandas/issues/18273
# https://github.com/jacksonjos/pandas/commit/e25710065c8b9291a35289bf3ef97cc62bf966cf

#Y_Response = pd.read_excel(xl, sheet_name="Logit", header=0, usecols=['Default'])
#X_Features = pd.read_excel(xl, sheet_name="Logit", header=0, usecols=['WC/TA', 'RE/TA', 'EBIT/TA', 'ME/TL', 'S/TA'])
#['Const', WC/TA', 'RE/TA', 'EBIT/TA', 'ME/TL', 'S/TA']

X_Features = pd.read_excel(xl, sheet_name="Logit", usecols=[4,5,6,7,8])
X_Features.head()
Out[3]:
WC/TA RE/TA EBIT/TA ME/TL S/TA
0 0.500799 0.306846 0.043373 0.956271 0.334774
1 0.547780 0.322214 0.051843 1.064545 0.334591
2 0.451001 0.225150 0.026813 0.804096 0.245585
3 0.306887 0.191936 0.030058 0.387010 0.253438
4 0.447246 0.217368 0.032458 0.791639 0.275531
In [4]:
X_Features.info() #len(X_Features)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   WC/TA    4000 non-null   float64
 1   RE/TA    4000 non-null   float64
 2   EBIT/TA  4000 non-null   float64
 3   ME/TL    4000 non-null   float64
 4   S/TA     4000 non-null   float64
dtypes: float64(5)
memory usage: 156.4 KB

Converting Y_Response

ravel() function converts our Y_Response from pandas DataFrame type to numpy Array type.

Some classifiers 'do not like' dataframe as an input for single-column dependent variable Y.

There is more, sometimes classifiers do not like an numpy array as an input, and we need to reshape into a 2D matrix

In [5]:
Y_Response = pd.read_excel(xl, sheet_name='Logit', usecols=[2])
Y_Response.head()
Out[5]:
Default
0 0
1 0
2 0
3 0
4 0
In [6]:
Y_Response = Y_Response['Default'].ravel()
print(Y_Response)
[0 0 0 ... 1 1 1]
In [7]:
Y_Response_2D = np.reshape(Y_Response, (-1, 1)) # For the fussy OLS computation in sklearn IMPORTANT FEATURE
print(Y_Response_2D)

# Reshaping an array from 1D plain list (above) to 2D is IMPORTANT functionality. Similar to vec2mat() in Matlab
#x = np.reshape(x, (len(x),-1))
[[0]
 [0]
 [0]
 ...
 [1]
 [1]
 [1]]

Part I. Classification with Logistic Model

Most classifiers are suitable for both binary and multinomial classification, Logistic Regression, SVM, Naive Bayes, Random Forest.

Here we will use a small credit default dataset with certain financial ratios (Altman Z-scores): Corporate_PD.xlsm.

CLASS LABELS -- these are our {0,1} set of survival/default outcomes.

Classification is done for our Default indicator column (Y_Response).

This binary classification has the most developed set of accruacy measures (which are not always available for multinomial classification/deep neural networks).

FEATURES -- these are our financial ratios WC/TA, RE/TA, and others.

Effective visualisation starts with scatterplots and hyperplanes for two choosen features (2D visualisation): plotted one versus another. Functionality such as radviz visualises tension plots between three features on 2D plane and can be informative.


For our corporate defaults, logistic regression computed in Excel by explicit MLE (based on the log of logit function) has the following exact result:

\begin{equation*} \text{Default} = + 0.4146 \,\text{WC/TA} -1.454 \,\text{RE/TA} -8.00 \,\text{EBIT/TA} -1.5936 \,\text{ME/TL} + 0.6198 \,\text{S/TA}+\varepsilon_t \end{equation*}



If you need need the full logit inference with standard errors and p-values, it is best to use statsmodels package in Python.

Or compute the Inverse Information $\mathbf{I}^{-1}$ by exact Maximum Likelihood Estimation explicitly! Implemented in Corporate_PD.xls file (Excel) with some VBA and use of Solver.

The regression model above is GLM (Generalised Linear Model) from the traditional Altman Z-score model which is referred to as probit regression (nonlinear link is inverse of the Normal CDF).



TAKE AWAYS FROM VARIOUS CLASSIFIERS RUN ON CORPORATE DEFAULT DATA

  • The most significant feature is Retained Earnings / Total Assets (RE/TA), while in the dataset there were plenty of negative retained earnings (annual loss) the default signal given by the fitted sigmod is too late: negative Retained Earnings (loss) has to be very large, then the model classifies an obvious case of default.

  • Choosen for its negative feedback, Market Equity (Cap)/Total Liabilities (ME/TL) feature is an example of exogenous variable and large cap is associated with survival.

  • Confusion matrices help to look into prediction accuracy within lass. As soon as we separate the dataset into train/test halves, Logistic Regression MISCLASSIFIED 39 out of 41 defaults. We also noticed that our Corporate PD.xls data is imbalanced: only 2% of companies defaulted.

  • Crossvalidation proves useless: regardless how we draw the sample, the mistake in predicting defaults is just not affecting the total accruacy score, which remains about 98%.

  • To tackle the issue, we created machine learning 'pipeline' where we fed Logistic Classifier object into Bagging Regressor. The method to fit the model (often a decision tree) from multiple sub-samples. Each competing model makes a prediction, and these predictions are averaged. That allowed us to predict up to half defaults correctly.

  • Visualisaiton of Decision Tree regressor revealed how it starts building the tree by separating within the feature ME/TL. Clustering with Radviz 'spring tension' plots confirmed that small cap companies tend to have higher rate of default.

  • It is important to use the Decision Tree Regressor because Decision Tree Classifier creates a very elaborate tree to predict data points on which it was trained close to perfectly.

  • The open question remains if the fitted Decision Tree will be as effective on OUT OF SAMPLE data. Are we any closer to predicting defaults, more specifically, for the large cap companies that are in trouble?

In [8]:
from sklearn import linear_model

# sklearn Logistic Regression as the classifier.  Adding column with intercept value -- not required, already in Excel data.
logit = linear_model.LogisticRegression(C=1e5)
logit.fit(X_Features, Y_Response)  ###FITTING DONE HERE
Out[8]:
LogisticRegression(C=100000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [12]:
print()
print('Logit beta coefficients for WC/TA, RE/TA, EBIT/TA, ME/TL, S/TA:')
print(logit.coef_)
print()
print('Intercept:')
print(logit.intercept_)
print()
print('model.classes_ =', logit.classes_)
Logit beta coefficients for WC/TA, RE/TA, EBIT/TA, ME/TL, S/TA:
[[ 0.41456713 -1.45401525 -8.00363047 -1.59359637  0.6198409 ]]

Intercept:
[-2.54337613]

model.classes_ = [0 1]

Sigmoid Plotting

We design our own routine to plot logistic sigmmoid to our specifications.

The input is the viable range of parameters for the feature, which for RE/TA ranges within $-7, 5$.

  • np.linspace(X_min, X_max, 3000) generates 3000 values in that range incrementally -- those are our input values (axe X).

  • logistic_sigmoid is computed with the use of relevant beta value (for RE/TA feature) from the logistic regresssion,

logit.coef_[0,FeatureBetaIdx] + logit.intercept_

  • OLS naive regression line is added using the same beta coefficient (for comparison).

  • data scatterplots are added at the start, using different colour for survival/default class labels {0,1}. The data is simplly scattered around the levels of zero, and one (horizontally).

ax.scatter(X_Features[FeatureName], Y_Response, c=Y_Response, zorder=20)

In [30]:
def logistic_sigmoid(xb):
    return (1 / (1 + np.exp(-xb)))
In [31]:
#AltmanScores_names = [['WC/TA', 'Working Capital/TA'], ['RE/TA', 'Retained Earnings/TA'], ['EBIT/TA', 'EBIT/TA'], ['ME/TL', 'Market Cap/Total Liabilities'], ['S/TA', 'Sales/TA'], ]


#Procedure RELIES X_Features, Y_Response variables to be existing
def logistic_plot(X_min, X_max, FeatureName, FeatureBetaIdx):  
    
    plt.clf() #clears the figure drawing space, nothing to do with classifier!
    fig, ax = plt.subplots(figsize=(18,10))  #fig = plt.figure(figsize=(18,10))
        
    # 1. Plot two clusters of observations at Y={0,1} on a scatter
    ax.scatter(X_Features[FeatureName], Y_Response, c=Y_Response, zorder=20) #X_Features[FeatureName].ravel()
    
    # 2. Plot CALIBRATED sigmoid function (in orange) -- with the coeffient from Logistic Regression
    X_Sim = np.linspace(X_min, X_max, 3000) #Fill in 3,000 values for the range of axe X
    Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_) # Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_).ravel() 
    ax.plot(X_Sim, Y_Loss, color='red', linewidth=3)
    
    # 3. Below plots OLS line
    ols = linear_model.LinearRegression()
    ols.fit(X_Features, Y_Response_2D) ### Fitting done here. Fussy OLS computation requires np.reshape(Y_Response, (-1, 1))
    
    ax.plot(X_Sim, ols.coef_[0,FeatureBetaIdx] * X_Sim + ols.intercept_, linewidth=1) #we plot beta*x + intercept (omitting other betas)
    ax.axhline(.5, color='.5')
    
    plt.ylabel('Default Indicator 1 or 0', fontsize=22) # also ax.set_ylabel('Default Indicator')
    plt.xlabel('wrt Feature: ' + AltmanScores_names[FeatureBetaIdx][1], fontsize=22)
    plt.xticks(range(X_min, X_max), fontsize=14) #Axe X range
    plt.yticks([0, 1], fontsize=14)
    plt.ylim(-.25, 1.25)
    plt.xlim(X_min, X_max) #Axe X range
    plt.legend(('Logistic Regression', 'Linear Regression'),
           loc="lower right", fontsize=14)
    #plt.show()
    return ax

# plt.subplots() is a function that returns a tuple containing a figure and axes object(s). 
# Thus when using fig, ax = plt.subplots() you unpack this tuple into the variables fig and ax. 
#type(fig) #<class 'matplotlib.figure.Figure'>
#type(ax) #<class 'matplotlib.axes._subplots.AxesSubplot'>
In [109]:
logistic_plot(-7, 5, 'RE/TA', 1)
Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1bad1c18>
<Figure size 432x288 with 0 Axes>

FEATURE RE/TA (above) we expolore this dimension, which is THE MOST SIGNIFICANT (t statistic) logit model coefficient.

Retained Earnings (Loss) ratio has both positive and negative values which is informative: most defaulting companies have a negative ratio (in any case RE/TA < 1). That is about where the sigmoid for non-default goes into 0 ('fires' no default signal). Unfortunately, its default signal is too late (for RE/TA < -3). This classifier needs more data in order to train to recognise the default.

The economic implication is that in this dataset, the lack of cash was not the main cause to default (the next feature, Working Capital is insignifcant in logit at all).

For this most significant logit feature (regression variable), the linear model makes some sense because it separated non-default cases into ones that non-overlap with default in the dimenision of this ratio at about $RE/TA > 0.6$

Linear regression (OLS) 'tries to mimick' the non-linear sigmoid here -- in the right direction but far from being a good classifier for the task.


In [11]:
logistic_plot(-7, 7, 'WC/TA', 0)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x110375dd8>
<Figure size 432x288 with 0 Axes>

FEATURE WC/TA (above) this dimension is for an INSIGNIFICANT logit model coefficient. Positive sign in front of the coefficient gives us an inverted sigmoid, stretched to a very long range -- this is what insignificance means.

We still see that the classifier (sigmoid) not calibrated to differentiate between default and non-default.


In [12]:
logistic_plot(-3, 5, 'ME/TL', 3)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1103758d0>
<Figure size 432x288 with 0 Axes>

FEATURE MC/TL (above) for the LEAST significant regression coefficient. This is typical of an exogenous variable (vs endogenous). The companies with the very large market capitalisation default less on their obligations. Defaults are clustered in small-cap range but there is no good separation from non-defaults.

Logistic vs OLS The linear regression model (OLS) is presented for comparison but in this case it is clearly 'confused' into a horizontal line. where it does not cross the cluster of observations but this is an artefact of separating 0s from 1s.


Supervised Learning with Logistic Regression

WHAT MAKES THE DIFFERENCE?

Softmax regression referred to as multinomial case for logistic regression.

  • Default setting for the binary classification is multi_class='ovr'. That, however can be set as multi_class='multinomial'.

  • Multinomial prediction will inevitably be based on the softmax function: it takes in the multidimensional covariate of features X and maps to multiple outputs. Probability prediction will be for each class. That would be applicable credit rating situations or gradations of companies in near-default state.

  • Parameter C is the inverse of regularization strength. The parameter is alike one in Support Vector Machines (described in further sections). Smaller values specify stronger regularization. Here, the type of regularisation is L2 penalty.

logit = linear_model.LogisticRegression(C=1e5)

L2 regularisation creates a Ridge regression (linear, logit). $C=1e5$ is very small meaning we nearly not regularise, therefore, $\beta$ values are not changed compared to Maximum Likelihood Estimation estimation of logit.

  • Corporate PD.xls Excel file has insides of the implementation of cannonial GLM (logit, probit) estimation by explicit Maximum Likelihood: Solver is run to vary beta to maximise the total log-likelihood (computed explicitly for the regression).

  • As an alternative to MLE, Logistic Cost Function can be solved by gradient descent. That approach allows modifying the penalty term, which is typically and below the sum of squares of regression coefficients.

  • That is called L2 regularization, the penalty works against regression coefficients being too large, eg. 8 vs 1.2.

\begin{equation} SSE = \sum_{i}^{N_{sample}} \left( \text{target}_i - \text{output}_i \right)^2 + \text{L2} \end{equation}

where for all reqression coefficients $w$ (instead of $\beta$), $\qquad \displaystyle \text{L2}: \lambda \, || w_2 || = \lambda \, \sum_{j}^{N_{coeff}} w_j^2 $.

The larger the value of $\lambda$, the stronger the regularization of the model (weights approach zero).

In [41]:
#Image('../CODE_ML/image/Logistic_L2Regularisation.png', width=500)

3) Without regularization, the objective is to find the global cost minimum (min SSE). By adding a regularization penalty, the objective becomes to minimize with the budget constraint where the budget space for $w_1, w_2$ parameters is given by the grey circle.



Transition Probabilities

Transition probabilities are possible to extract for each type of classifier (eg, Logistic, SVM, even Decision Tree). Here, for Logistic Classifier $p$ is computed by logistic function on $\beta' X$ and represents the probability of default -- class label 1 and second column.

The probability of survival is $1-p$, and given in the first column.

The output gives way to probability transition matrix but not between states as with Markov chains, just per observation. If we group observations into say, credit ratings then it would be possible to further compute the proper transition matrix.

In [13]:
# Remember we fitted the Logistic Model (no need to specify an intercept additionaly)
#logit = linear_model.LogisticRegression(C=1e5)
#logit.fit(X_Features, Y_Response)

#clf.predict(X_Features)
logit.predict_proba(X_Features)
Out[13]:
array([[0.98840442, 0.01159558],
       [0.99088592, 0.00911408],
       [0.98249419, 0.01750581],
       ...,
       [0.94488136, 0.05511864],
       [0.89797991, 0.10202009],
       [0.93257805, 0.06742195]])

Classification Report and Confusion Matrix

In [185]:
y_pred = logit.predict(X_Features)
In [183]:
print(classification_report(Y_Response, y_pred))
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3928
           1       0.43      0.04      0.08        72

    accuracy                           0.98      4000
   macro avg       0.71      0.52      0.53      4000
weighted avg       0.97      0.98      0.97      4000

In [184]:
# Confusion Matrix
cm = confusion_matrix(Y_Response, y_pred)
temp = pd.DataFrame(cm, index=['0', '1'], columns=['0', '1'])

sns.heatmap(temp, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Part II. Bagging of Logistic Regressor

Alternative model is Decision Trees

There exist Python packages to address imbalanced samples. However, the popular data scientist choice is to tackle the problems (of impalanced sample and unsatisfactory prediction) by improving and complicating the estimation method, the model -- creating a Bagging Classifier pipeline.


Definition: a bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original data and then, aggregate predictions for the individual line (by voting/averaging) to form the single final prediction.

This is a meta-estimator that reduces the variance of a black-box estimator (eg, decision tree can be elaborate and hard to explain why particular spits were chosen). Randomisation intentionally introduced into construction procedure (either random subset, or randomly breaking the model!), resulting in the ensemble of models.

Further advancement to bagging/boosting (Ensemble methods)

Description for a BAGGING ALGORITHM: take samples from the training data, then models (usually decision trees) are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value (this is like softmax function, but averaging is done over models that differs structurally).

Here the observations count on which Bagging Regressor operates is obviously above $> 4,000$ (Ncompanies in our dataset) -- notice the density of dots.

Default parameter was n_estimators = 10, so 10 alternative Decision Tree models were produced, each will have different companies predicted as defaulted (REF).

In [14]:
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import BaggingClassifier
In [15]:
# Common application over KNeighborsClassifier()
# from sklearn.neighbors import KNeighborsClassifier

#bagging_classKNN = BaggingClassifier(KNeighborsClassifier(),
                          max_samples=0.5, max_features=0.5)

#bagging_classSVN = BaggingClassifier(SVC(C=C, probability=True),
#                          max_samples=0.5, max_features=0.5)
In [16]:
#We are going to use Logistic Classifier. Below were already executed so 'logit' variable exists
#logit = linear_model.LogisticRegression(C=1e5)
#logit.fit(X_Features, Y_Response)

bagging_reg_logit = BaggingRegressor(logit)
bagging_reg_logit.fit(X_Features, Y_Response)
In [19]:
bagging_pred_logit = bagging_reg_logit.predict(X_Features) #this is just an array with predictions
bagging_pred_logit
Out[19]:
array([0. , 0. , 0. , ..., 0. , 0.1, 0. ])
In [33]:
print()
print('Returns the coefficient of determination R^2 of the prediction:')
print(bagging_reg_logit.score(X_Features, Y_Response, sample_weight=None))
Returns the coefficient of determination R^2 of the prediction:
0.0007637474541747347
NOTE. This is for exact prediction but we have likelihood.

This is very poor R^2 but see more how we ustilise and interpret the output of bagging.

In [26]:
#print()
#print('The collection of fitted sub-estimators')
#print(bagging_reg_logit.estimators_)

#Returns a dynamically generated list of indices identifying the samples used for fitting each member of the ensemble, i.e., the in-bag samples.
#Note: the list is re-created at each call to the property in order to reduce the object memory footprint by not storing the sampling data.
    
print('The subset of drawn samples for each base estimator (Logistic Classifier)')
print(bagging_reg_logit.estimators_samples_)

print()
print(bagging_reg_logit.estimators_features_)

#print('Score of the training dataset obtained using an out-of-bag estimate')
#print(bagging_reg_logit.oob_score_) #NOT IN THIS VERSION
The subset of drawn samples for each base estimator (Logistic Classifier)
[array([False, False, False, ...,  True,  True, False]), array([ True,  True, False, ...,  True,  True, False]), array([ True, False,  True, ...,  True, False, False]), array([ True, False,  True, ..., False, False, False]), array([False,  True,  True, ..., False,  True, False]), array([ True, False, False, ..., False,  True,  True]), array([False,  True, False, ...,  True, False,  True]), array([ True,  True, False, ...,  True,  True,  True]), array([ True,  True,  True, ..., False,  True,  True]), array([False, False, False, ...,  True,  True, False])]

[array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4])]

Bagging Regressor over Logistic Classifier (plot)

In [1]:
#Quick scatter to check
#plt.scatter(X_Features['RE/TA'], bagging_pred_logit, c=bagging_pred_logit, zorder=20)
In [32]:
#Full scatter plot with sigmoid

plt.figure(figsize=(18,10))
plt.scatter(X_Features['RE/TA'], bagging_pred_logit, c=bagging_pred_logit, zorder=20) #X_Features[FeatureName].ravel()
plt.colorbar()
plt.ylabel('Default Likelihood: Bagged Logistic Classifier', fontsize=20) # also ax.set_ylabel('Default Indicator')
plt.xlabel('wrt Feature: ' + 'RE/TA', fontsize=20)

X_Sim = np.linspace(-4, 2, 1500) #Fill in 3,000 values for the range of axe
Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,1] + logit.intercept_) # Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_).ravel() 
plt.plot(X_Sim, Y_Loss, color='red', linewidth=1)
plt.show()

INTERPRETATION

The default number of estimators (alternative logistic regressions) is 10 (_nestimators int, optional (default=10)). On the plot above: all ten models predicted the same two companies (top dots) to deafult; nine models predicted one the same company to default.

We filtered out those three very clear defaults, for which Retained Earnings/Total Assets ratio $<-1.8$.

It is also agreeable that for $\text{RE/TA} > 0.3$, there are no defaults.

Conclusion: depending on how the threshold $\text{L} > 0.2$ is set from the gradient scale on the right, 12 defaults or more defaults can be predicted compared to 2. That is x6 increase in accuracy! Compared to non-workable k-fold crossvalidation.

Exact results WILL DIFFER ON EACH RUN.


In [40]:
bagging_reg = BaggingRegressor()
bagging_reg.fit(X_Features, Y_Response)
bagging_pred = bagging_reg.predict(X_Features)

plt.figure(figsize=(18,10))
plt.scatter(X_Features['RE/TA'], bagging_pred, c=bagging_pred, zorder=20) #X_Features[FeatureName].ravel()
plt.ylabel('Default Likelihood: Bagged Alternative Classifier', fontsize=20) # also ax.set_ylabel('Default Indicator')
plt.xlabel('wrt Feature: ' + 'RE/TA', fontsize=20)
plt.show()

On the plot ABOVE we see see default predictions from the random forest (10 decision trees) -- yellow points.

  • Those 10 decision trees are perhaps too different than necessary, but each is accurate in its unique way. Combining their predictions results in a better estimate of the true value.

  • A clever bit of bagging: decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness. That should improve out-of-sample predictions. (Otherwise, given large enough tree it will fit the sample excessively perfect).

  • TAKE AWAY If you get good classifying or clastring results with an algorithm that has inherently high variance (eg Decision Tree), you can often get better results by bagging that algorithm.

Below is continuation of our bagging of the algorithm, now with the default base estimator method being Decision Tree. This is more common combination (Bagging + Random Forest).

In [41]:
#AltmanScores_names = [['WC/TA', 'Working Capital/TA'], ['RE/TA', 'Retained Earnings/TA'], ['EBIT/TA', 'EBIT/TA'], ['ME/TL', 'Market Cap/Total Liabilities'], ['S/TA', 'Sales/TA'], ]
#
#Procedure not independent, relies on X_Features, Y_Response


def bagging_scatterX2(FeatureName, FeatureBetaIdx, comp_estimator):  
    
    plt.clf #to clean canvass
    fig = plt.figure(figsize=(18,10))

    # 1) Bagging estimator using Logit as base
    bagging_reg_logit = BaggingRegressor(logit)
    bagging_reg_logit.fit(X_Features, Y_Response)
    bagging_pred_logit = bagging_reg_logit.predict(X_Features)
    
    plt.subplot(221)
    plt.scatter(X_Features[FeatureName], bagging_pred_logit, c=bagging_pred_logit, zorder=20)
    plt.colorbar()
    plt.ylabel('Bagged Logistic Classifier', fontsize=12) 
    plt.xlabel('wrt Feature: ' + FeatureName, fontsize=12)

    # Add CALIBRATED sigmoid function (in orange) -- with the coeffient from Logistic Regression
    X_Sim = np.linspace(-4, 2, 1500) #Fill in 3,000 values for the range of axe
    Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_) # Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_).ravel() 
    plt.plot(X_Sim, Y_Loss, color='red', linewidth=1)    

    
    # 2) Bagging estimator using AN ALTERNATIVE as base, no input = decision tree
    bagging_reg = BaggingRegressor(comp_estimator)
    bagging_reg.fit(X_Features, Y_Response)
    bagging_pred = bagging_reg.predict(X_Features)

    plt.subplot(222)
    plt.scatter(X_Features[FeatureName], bagging_pred, c=bagging_pred, zorder=20)
    plt.colorbar()
    plt.ylabel('Bagged ALTERNATIVE Classifier', fontsize=12)
    plt.xlabel('wrt Feature: ' + FeatureName, fontsize=12)
    #plt.show()
    return 

bagging_scatterX2('RE/TA', 1, None) #most significant feature
In [42]:
bagging_scatterX2('WC/TA', 0, None) #inignificant feature in logit
In [43]:
bagging_scatterX2('ME/TL', 3, None) #least significant feature

Part III. Decision Trees and Boosting (AdaBoost)

Decision Tree algorithm minimises entropy or 'seeks purity': the algorithm searches for a split that will lead to clusters that are as pure as possible, ie, with one very dominant class. The search is done according to the defined Loss Function which is typically the Gini impurity index.

All loss functions can be expressed as the proportions generated by the output: if there are J classes, we denote these proportions with $p_j$.

  • the Gini impurity index: $ 1-\sum_{j=1}^Jp_j^2; $

  • the misclassification error: $ 1-\underset{j}{\text{max}}\, p_j; $

  • entropy: $ -\sum_{j=1}^J\log(p_j)p_j. $

Default Probability with Decision Trees

Initially we will run decision trees on the dataset of TWO FEATURES only, which we have learned to be the most explanatory. These features coincide with our Restricted model M0 from Corporate PD.xls explicit by-step implementation of logistic classier in Excel.

Why two features only? You will understand quickly how unbounded, unpruned Decision Tree Classifier branches out to a very elaborate tree, which classifies nearly every observation with a separate, dedicated leaf! In fact, decision trees are adaptive to the sample, not many branches are required to zero-in on MinNumInLeaf of 1-2 observations.

In this demonstration, we will make the structure of decision tree explicit. But the extraction of individual trees, visualisation and synthesis across Random Forest are out of reach.

Classifier vs Regressor

The relative advantage of a regressor, in the context of bagging and boosting for the binary variable, is that it allows us to observe in-between values {0,1}. That allows a loose interpretation as a probability.

The inherent task is classification because the dependent variable is categorical (here binary). Therefore, Decision Tree Classifiers will naturally isolate class labels with more efficiency and report high accuracy scores > 0.95. Like with all classifiers, confusion matrix should be examined to see in-class prediction. This is more challenging to do with regression output.

In [27]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor


import graphviz
from sklearn.tree import export_graphviz 
from IPython.display import Image
In [180]:
# REPEAT OF Data Loading


AltmanScores_names = [['WC/TA', 'Working Capital/TA'], ['RE/TA', 'Retained Earnings/TA'], ['EBIT/TA', 'EBIT/TA'], ['ME/TL', 'Market Cap/Total Liabilities'], ['S/TA', 'Sales/TA'], ]
#also ['Const','Intercept'] # scores_altman[0][0]

X_Features = pd.read_excel(xl, sheet_name="Logit", usecols=[4,5,6,7,8])
X_Features.head()


Y_Response = pd.read_excel(xl, sheet_name='Logit', usecols=[2])
Y_Response = Y_Response['Default'].ravel()
In [9]:
X_Features_2D = X_Features.loc[:,['RE/TA', 'ME/TL']] 
In [ ]:
rng = np.random.RandomState(1)
In [37]:
regr_1 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X_Features_2D, Y_Response)
Out[37]:
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=4,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')
In [45]:
regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
                          n_estimators=300, random_state=rng)
regr_2.fit(X_Features_2D, Y_Response)
Out[45]:
AdaBoostRegressor(base_estimator=DecisionTreeRegressor(ccp_alpha=0.0,
                                                       criterion='mse',
                                                       max_depth=4,
                                                       max_features=None,
                                                       max_leaf_nodes=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fraction_leaf=0.0,
                                                       presort='deprecated',
                                                       random_state=None,
                                                       splitter='best'),
                  learning_rate=1.0, loss='linear', n_estimators=300,
                  random_state=RandomState(MT19937) at 0x7FD67B1E47C0)
In [48]:
# PREDICTION
y_1 = regr_1.predict(X_Features_2D)
y_2 = regr_2.predict(X_Features_2D)
In [54]:
y_1[1:100]
Out[54]:
array([0.        , 0.        , 0.00879567, 0.        , 0.        ,
       0.07920792, 0.234375  , 0.00879567, 0.00879567, 0.00879567,
       0.        , 0.        , 0.07920792, 0.07920792, 0.07920792,
       0.07920792, 0.00879567, 0.00879567, 0.00879567, 0.00879567,
       0.00879567, 0.00879567, 0.00879567, 0.00879567, 0.00879567,
       0.00879567, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00879567, 0.00879567, 0.00879567,
       0.00879567, 0.00879567, 0.00879567, 0.00879567, 0.00879567,
       0.00879567, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00879567, 0.00879567, 0.00879567, 0.00879567,
       0.00879567, 0.00879567, 0.07920792, 0.00879567, 0.00879567,
       0.00879567, 0.00879567, 0.00879567, 0.01694915, 0.00879567,
       0.00879567, 0.00879567, 0.00879567, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00879567, 0.00879567, 0.00879567,
       0.00879567, 0.00879567, 0.00879567, 0.00879567, 0.00879567,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00879567, 0.00879567, 0.00879567])
In [55]:
y_2[1:100]
Out[55]:
array([0.        , 0.        , 0.44827586, 0.        , 0.        ,
       0.48558758, 0.49567244, 0.45631068, 0.48325893, 0.2907916 ,
       0.        , 0.00314196, 0.4702381 , 0.47368421, 0.48325893,
       0.47789116, 0.00606061, 0.41442716, 0.41442716, 0.00314196,
       0.00314196, 0.00314196, 0.00314196, 0.00314196, 0.00314196,
       0.00314196, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.41442716, 0.00606061, 0.40113985,
       0.40113985, 0.40113985, 0.41442716, 0.2907916 , 0.41442716,
       0.43698061, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00314196, 0.43698061, 0.41902314, 0.4702381 ,
       0.43698061, 0.46319703, 0.48662664, 0.48998946, 0.32692308,
       0.00314196, 0.4702381 , 0.41698113, 0.47789116, 0.46857143,
       0.46319703, 0.45179856, 0.43698061, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00314196, 0.32692308, 0.48076375,
       0.47490706, 0.48331999, 0.41902314, 0.41902314, 0.23342939,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00314196, 0.00314196, 0.45179856])
  • y_1 stores predictions from Decision Tree Regressor (max_depth=4)
  • y_2 stores predictions from AdaBoost Regressor (boosted Decision Trees Regressor)
  • y_3 stores predictions from Decision Tree Classifier, no maximum depth. Therefore, that elaborate tree fits all in-sample observations perfectly.

Its in-sample prediction accuracy will is near 100%. When in-sample error is zero, and training error is zero -- that does not allow us to compute any estimates of the extra-sample error (think “0.632 Estimator”), exactly when the model is surely overfitted and the extra-sample error going to be at its largest.

  • y_4 stores predictions from Boosted Decision Tree Classifier
In [175]:
dot_data = export_graphviz(regr_1, out_file=None, 
                           feature_names= ['RE/TA', 'ME/TL'], 
                           class_names=['survival', 'default'], filled=True,
                           rounded=True,
                           proportion=True)

graph = graphviz.Source(dot_data, filename='../CODE_ML_DT/image/TreeREGRESSOR', format='png')
graph.render(); #saved graph as an image in target directory
In [176]:
Image('../CODE_ML_DT/image/TreeREGRESSOR_COPY.png')
Out[176]:

GraphViz package (2022 Guidance)

Problem 1. For MacOS, simple install with conda should resolve the issues, below are Terminal commands. Conda installs seem to be in addition to pip install (and installed into separate directory /python 3.7/site-packages). Conda environment location: /Users/YOURUSERNAME/opt/anaconda3

conda install graphviz

conda install -c conda-forge pydotplus

Graphviz Guidance

RuntimeError/Graphviz executables - reinstall

RuntimeError/Graphviz executables - edit PATH

Problem 2. It is only possible to visiualise one tree at a time (not ensemble of trees / Random Forest). To make things difficult, there is no an easy way to graph the "best" tree. Though, he sample code below gives an idea on how to pick one tree from the forest.

forest_clf = RandomForestClassifier()
forest_clf.fit(X_train, y_train)
tree.export_graphviz(forest_clf.estimators_[0], out_file='tree_from_forest.dot')
(graph,) = pydot.graph_from_dot_file('tree_from_forest.dot')
graph.write_png('tree_from_forest.png')

_sklearn.tree.export_graphviz(decisiontree, ...) exports one decision tree. If you attempt to pass fitted Random Forest into this function (AdaBoostRegressor, AdaBoostClassifier), the export for graphviz visualisaiton will fail.

NotFittedError



Below are experiments with Decision Tree Classifiers using Gini impurity index and Entropy as loss functions. Observe that Gini results are exactly similar to Decision Tree Regressor, which reports MSE as its loss function.

In [71]:
clf = DecisionTreeClassifier(random_state=rng, max_depth=4) #bounded classifier - decision tree pruned
clf.fit(X_Features_2D, Y_Response)
Out[71]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=RandomState(MT19937) at 0x7FD67B1E47C0,
                       splitter='best')
In [75]:
print(f'Tree Regressor Accuracy: {regr_1.score(X_Features_2D, Y_Response):0.4}')
print(f'Tree Classifer Accuracy (Gini impurity): {clf.score(X_Features_2D, Y_Response):0.4}')
Tree Regressor Accuracy: 0.3908
Tree Classifer Accuracy (Gini impurity): 0.9878
In [40]:
dot_data = export_graphviz(clf, out_file=None, 
                           feature_names= ['RE/TA', 'ME/TL'], 
                           class_names=['survival', 'default'], filled=True,
                           rounded=True,
                           proportion=True)

graph = graphviz.Source(dot_data, filename='../CODE_ML_DT/image/TreeCLASSIFIER_GINI', format='png')
graph.render();
In [41]:
Image('../CODE_ML_DT/image/TreeCLASSIFIER_GINI_COPY.png')
Out[41]:
In [76]:
clf = DecisionTreeClassifier(random_state=rng, max_depth=4, criterion='entropy') #bounded classifier - decision tree pruned
clf.fit(X_Features_2D, Y_Response)
Out[76]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=RandomState(MT19937) at 0x7FD67B1E47C0,
                       splitter='best')
In [77]:
print(f'Tree Regressor Accuracy: {regr_1.score(X_Features_2D, Y_Response):0.4}')
print(f'Tree Classifer Accuracy (Entropy): {clf.score(X_Features_2D, Y_Response):0.4}')
Tree Regressor Accuracy: 0.3908
Tree Classifer Accuracy (Entropy): 0.987
In [43]:
dot_data = export_graphviz(clf, out_file=None, 
                           feature_names= ['RE/TA', 'ME/TL'], 
                           class_names=['survival', 'default'], filled=True,
                           rounded=True,
                           proportion=True)

graph = graphviz.Source(dot_data, filename='../CODE_ML_DT/image/TreeCLASSIFIER_ENTROPY', format='png')
graph.render();
In [44]:
Image('../CODE_ML_DT/image/TreeCLASSIFIER_ENTROPY_COPY.png')
Out[44]:
In [79]:
clf = DecisionTreeClassifier(random_state=rng) #unlimited max_depth classifier
clf.fit(X_Features_2D, Y_Response)
Out[79]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=RandomState(MT19937) at 0x7FD67B1E47C0,
                       splitter='best')
In [80]:
dot_data = export_graphviz(clf, out_file=None, 
                           feature_names= ['RE/TA', 'ME/TL'], 
                           class_names=['survival', 'default'], filled=True,
                           rounded=True,
                           proportion=True)

graph = graphviz.Source(dot_data, filename='../CODE_ML_DT/image/TreeCLASSIFIER_NOPRUNING', format='png')
graph.render();
In [81]:
Image('../CODE_ML_DT/image/TreeCLASSIFIER_NOPRUNING_COPY.png')
Out[81]:
In [82]:
# Predict
y_3 = clf.predict(X_Features_2D) #  below are predictions form unpruned Classifer with zero in-sample error.

y_3[1:100]
Out[82]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [83]:
y_3[3950:4000] # about 37 default cases are at the end of the dataset of 4,000 observations
Out[83]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1])

WHOA!

Unbound, unpruned Decision Tree Classifier returned to us the predictions that coincide exactly with survival 0 / default 1 in the original data.

That was not very difficult to achieve for the decision tree method on this Corporate_PD dataset. After all, it is very imbalanced: among 4,000 annual observations (830 companies, 1999-2004), there were only 72 default instances (observations) occurred within this five-year period.

This is overfitting at its finest and it's easy to overlook. Without the information about in-sample error and training error -- we can't even begin to make projections about how wrong this tree will behave out-of-sample. In this context, we skip Train/Test split exercise (no need) because we go direct for Boosted Decision Trees which will be estimated using different subsamples (up to 300).

We will hit another issue with our chosen method to boost trees (AdaBoost), and therefore recommend XGBClassifier with properly scaled data as the next step of improvement (for CQF delegates -- see the relevant Python Lab on eXtreme Gradient Boosting).

In [177]:
print(f'Tree Regressor Accuracy (Untuned) : {regr_1.score(X_Features_2D, Y_Response):0.4}')
print(f'Tree Regressor Accuracy AdaBoost: {regr_2.score(X_Features_2D, Y_Response):0.4}')
print(f'Tree Classifer Accuracy (Unpruned): {clf.score(X_Features_2D, Y_Response):0.4}')
Tree Regressor Accuracy (Untuned) : 0.3908
Tree Regressor Accuracy AdaBoost: -0.475
Tree Classifer Accuracy (Unpruned): 1.0
In [187]:
# Confusion Matrix for Unpruned Decision Tree Classifier (100% in-sample accuracy)
cm = confusion_matrix(Y_Response, y_3)
temp = pd.DataFrame(cm, index=['0', '1'], columns=['0', '1'])

sns.heatmap(temp, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Hyperparameter Tuning

For any classifier (Decision Tree, Logistic, SVM, etc), the common and automated search for the best hyperparameters is done using GridSearchCV or RandomizedSearchCV. Both implement a “fit” and a “score” methods and rely on cross-validated search over parameter space.

RandomizedSearchCV method is more appropriate when running complicated pipelines of boosted decision trees (eg XGBoost) because a fixed number of parameter settings is sampled from the specified distributions. Not all parameter values are tried as with GridSearchCV and that saves time.

However for our end purpose to plot surfaces that relate the key hyperparameters of a decision tree (MinNum2Split, MinNumInLeaf, MaxDepth) it was necessary to code manual iteration and saving of model space results.

In [ ]:
from pylab import plt
plt.style.use('seaborn')

import matplotlib.pyplot as pyplot
from mpl_toolkits import mplot3d 
In [170]:
### OMIT CODE FROM RELEASE ###
In [171]:
### OMIT CODE FROM RELEASE ###

3D plots exporing hyperparameter optimal search are presented below.

First, results have been saved for Decision Tree Classifer -- that can be seen from the relatively high accuracy range (0.93, 0.955). maxDepth was allowed to vary from 1 to 10 as well but 4 remained the optimal choice.

Tuned hyperparameters for a classifier

In [172]:
accuracy_space = checkHyperparamsDT(X_Features_2D, Y_Response)
accuracy_space.loc[accuracy_space['TestAccuracy'].idxmax(),:].to_frame().T
Out[172]:
MinNum2Split MinNumInLeaf MaxDepth TrainAccuracy TestAccuracy
21 2.0 3.0 4.0 0.99375 0.95
In [173]:
fig = pyplot.figure()
ax = fig.gca(projection='3d')

clf_surf1=ax.plot_trisurf(accuracy_space['MinNum2Split'], accuracy_space['MinNumInLeaf'], accuracy_space['TestAccuracy'], cmap=plt.cm.viridis, linewidth=0.2)
fig.colorbar(clf_surf1, shrink=0.5, aspect=5)

ax.set_zlim(0.93, 0.955) #MANUAL ADJUSTMENT -- tight values since Classifier predicts too well
plt.xlabel('Min to Split')
plt.ylabel('Min in Leaf')
plt.savefig('image/clf_surf1.png')
plt.show()
In [174]:
fig = pyplot.figure()
ax = fig.gca(projection='3d')

clf_surf2 =ax.plot_trisurf(accuracy_space['MaxDepth'], accuracy_space['MinNumInLeaf'], accuracy_space['TestAccuracy'], cmap=plt.cm.viridis, linewidth=0.2)

fig.colorbar(clf_surf2, shrink=0.5, aspect=5)
ax.set_zlim(0.93, 0.955) #MANUAL ADJUSTMENT -- tight values since Classifier predicts too well
plt.xlabel('Min in Leaf')
plt.ylabel('Maximum Depth')
plt.savefig('image/clf_surf2.png')
plt.show()

Second, the code above altered to run Decision Tree Regressor (Classifier line commented out) and results below are saved for Decision Tree Regressor

maxDepth was allowed to range from 1 to 10 -- even as it leads to unmanageable trees, and does not improve accuracy towards 100% correct in-sample prediction. The viable accuracy range for test data for a tree regressor is (0, 0.20).

Tuned hyperparameters for a regressor

In [166]:
accuracy_space = checkHyperparamsDT(X_Features_2D, Y_Response)
accuracy_space.loc[accuracy_space['TestAccuracy'].idxmax(),:].to_frame().T
Out[166]:
MinNum2Split MinNumInLeaf MaxDepth TrainAccuracy TestAccuracy
629 9.0 7.0 9.0 0.260401 0.133929
In [167]:
fig = pyplot.figure()
ax = fig.gca(projection='3d')

reg_surf1=ax.plot_trisurf(accuracy_space['MinNum2Split'], accuracy_space['MinNumInLeaf'], accuracy_space['TestAccuracy'], cmap=plt.cm.viridis, linewidth=0.2)
fig.colorbar(clf_surf1, shrink=0.5, aspect=5)

ax.set_zlim(0, 0.20) #MANUAL ADJUSTMENT -- tight values since Classifier predicts too well
plt.xlabel('Min to Split')
plt.ylabel('Min in Leaf')
plt.savefig('image/clf_surf1.png')
plt.show()
In [169]:
fig = pyplot.figure()
ax = fig.gca(projection='3d')

reg_surf2 =ax.plot_trisurf(accuracy_space['MaxDepth'], accuracy_space['MinNumInLeaf'], accuracy_space['TestAccuracy'], cmap=plt.cm.viridis, linewidth=0.2)

fig.colorbar(clf_surf2, shrink=0.5, aspect=5)
ax.set_zlim(0, 0.20) #MANUAL ADJUSTMENT -- tight values since Classifier predicts too well
plt.xlabel('Min in Leaf')
plt.ylabel('Maximum Depth')
plt.savefig('image/clf_surf2.png')
plt.show()
In [ ]:
 

Onto AdaBoost


AdaBoostM1 was the first successful boosting algorithm developed for binary classification. AdaBoost is a good starting point to study the simple scheme of boosting and related maths. For the quick introduction, CQF delegates can refer to Ensemble Methods lecture and Section 3 of DT Theory and Stock Selection - T Guida.ipynb.

The large `however' is in the strategy of AdaBoost type of algorithm to progressively focus on the observations that yield the largest prediction errors. Focusing on better prediction for the outliers might turn out not to be the best strategy.


Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.

  • First, a model is fitted from the training data, then

  • Second and next models attempt to correct the errors from the first (prior) model.

In [231]:
plt.clf
plt.figure(figsize=(12, 10))
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples")

plt.scatter(X_Features_2D['RE/TA'], y_1, c="g", label="n_estimators=1", linewidth=2)
plt.scatter(X_Features_2D['RE/TA'], y_2, c="r", label="n_estimators=300", linewidth=2)
plt.xlabel("for feature: Retained Earnings/Total Assets", fontsize=14)
plt.ylabel("target", fontsize=14)
plt.title("TWO MODELS (Decision Tree - green, Boosted - red): Predictions for Corporate Default", fontsize=14) #plt.colorbar() ADD ON NEXT RUN
plt.legend()
plt.show()

#xlabel= 'wrt Feature: ' + AltmanScores_names[FeatureBetaIdx][1]
#plt.xlabel(xlabel)

NOTE Two models are mixed on scratterplot such as these. In fact there are three datasets: ONE ACTUAL and TWO PREDICTED from our models:

regr_1 = DecisionTreeRegressor(max_depth=4)

regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
                          n_estimators=300, random_state=rng)

The scatterered output reveals plenty of search for our binary classification of default/non-default in Corporate PD dataset.

The nature of AdaBoost Classifier is to create copies of the trees (meaning copies of the dataset) and we see the search example produced by a random forest.

In [240]:
plt.clf #to clean canvass
fig = plt.figure(figsize=(18,10))

plt.subplot(221)
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples(data)")
plt.scatter(X_Features_2D['RE/TA'], y_1, c="g", label="n_estimators=1", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12) 
plt.xlabel('Retained Earnings/Total Assets', fontsize=14)
plt.title('Decision Tree Regression', fontsize=14)

plt.subplot(222)
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples(data)")
plt.scatter(X_Features_2D['RE/TA'], y_2, c="r", label="n_estimators=300", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12)
plt.xlabel('Retained Earnings/Total Assets', fontsize=14)
plt.title('BOOSTED Decision Tree Regression', fontsize=14)

plt.subplot(212)
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples(data)")
plt.scatter(X_Features_2D['RE/TA'], y_3, c="r", label="n_estimators=300", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12)
plt.xlabel('Retained Earnings/Total Assets', fontsize=14)
plt.title('Decision Tree CLASSIFIER', fontsize=14)


plt.show()

Retained Earnings/Total Assets feature.

AdaBoostRegressor clearly created random estmators (and struggled to predict Defaults?)

Decision Tree Classifier has such an elaborate fitted decision tree (above) that it predicts every in-sample point pretty much exactly.


In [218]:
plt.clf #to clean canvass
fig = plt.figure(figsize=(18,10))

    
plt.subplot(221)
plt.scatter(X_Features_2D['ME/TL'], Y_Response, c="k", label="training samples (data)")
plt.scatter(X_Features_2D['ME/TL'], y_1, c="g", label="n_estimators=1", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12) 
plt.xlabel('Market Equity (Cap)/Total Liabilities', fontsize=14)
plt.title('Decision Tree Regression', fontsize=14)

plt.subplot(222)
plt.scatter(X_Features_2D['ME/TL'], Y_Response, c="k", label="training samples (data)")
plt.scatter(X_Features_2D['ME/TL'], y_2, c="r", label="n_estimators=300", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12)
plt.xlabel('Market Equity (Cap)/Total Liabilities', fontsize=14)
plt.title('BOOSTED Decision Tree Regression', fontsize=14)

plt.show()

Market Equity (Cap)/Total Liabilities feature.


In [195]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier


X = X_Features_2D #X_Features.loc[:,['RE/TA']] 
y = Y_Response

# Create and fit an AdaBoosted decision tree
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                         algorithm="SAMME",
                         n_estimators=200)

bdt.fit(X, y) #bdt for Boosted Decision Tree
Out[195]:
AdaBoostClassifier(algorithm='SAMME',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=200, random_state=None)
In [196]:
bdt.predict_proba(X)
Out[196]:
array([[0.56632825, 0.43367175],
       [0.56632825, 0.43367175],
       [0.56018259, 0.43981741],
       ...,
       [0.52860768, 0.47139232],
       [0.48728029, 0.51271971],
       [0.49114406, 0.50885594]])
In [ ]:
#EXPERIMENTAL -- ISSUE WITH INDEXATION OF X, need to limit to one feature?
plot_colors = "br"
plot_step = 0.02
class_names = "SD"


# Plot the decision boundaries
plt.figure(figsize=(20, 10))
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1],
                c=c, cmap=plt.cm.Paired,
                s=20, edgecolor='k',
                label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right', fontsize=14)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary', fontsize=14)
In [209]:
# Plot the two-class decision scores
class_names = "SD"
plt.figure(figsize=(15, 10))
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
for i, n, c in zip(range(2), class_names, plot_colors):
    plt.hist(twoclass_output[y == i],
             bins=10,
             range=plot_range,
             facecolor=c,
             label='Class %s' % n,
             alpha=.5,
             edgecolor='k')
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc='upper right', fontsize=14)
plt.ylabel('Samples', fontsize=14)
plt.xlabel('Score', fontsize=14)
plt.title('Decision Scores', fontsize=14)

plt.tight_layout()
plt.subplots_adjust(wspace=0.35)
plt.show()

Decision Score

  • The predicted class label for each sample is determined by the sign of decision score. Samples with decision scores greater than zero are classified as DEFAULT, and are otherwise classified as SURVIVAL. The magnitude of a decision score determines the degree of likeness with the predicted class label.

  • Decision Tree Classifier limits itself to the set class labels. Decision Tree Regression is unlike that and provides some continous value for the dependent variable.

In [203]:
y_4 = bdt.predict(X)

plt.clf
plt.figure(figsize=(12, 10))
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples")

plt.scatter(X_Features_2D['RE/TA'], y_4, c="r", label="n_estimators=200", linewidth=2)
plt.xlabel("Feature: Retained Earnings/Total Assets", fontsize=14)
plt.ylabel("target", fontsize=14)
plt.title("BOOSTED Decision Tree Classifier: Predictions for Corporate Default", fontsize=14) #plt.colorbar() ADD ON NEXT RUN
plt.legend()
plt.show()

Corporate Default Prediction (repeat)


TAKE AWAYS FROM TESTING OF ALL CLASSIFIERS

  • The most significant feature is Retained Earnings / Total Assets (RE/TA), while in the dataset there were plenty of negative retained earnings (annual loss) the default signal given by the fitted sigmod is too late: negative Retained Earnings (loss) has to be very large, then the model classifies an obvious case of default.

  • Choosen for its negative feedback, Market Equity (Cap)/Total Liabilities (ME/TL) feature is an example of exogenous variable and large cap is associated with survival.

  • Confusion matrices help to look into prediction accuracy within lass. As soon as we separate the dataset into train/test halves, Logistic Regression MISCLASSIFIED MOST defaults (eg 69 of 72). We also noticed that our Corporate PD.xls data is imbalanced: only 2% of companies defaulted.

  • Crossvalidation proves useless: regardless how we draw the sample, the mistake in predicting defaults is just not affecting the total accruacy score, which remains about 98%.

  • To tackle the issue, we created machine learning 'pipeline' where we fed Logistic Classifier object into Bagging Regressor. The method to fit the model (often a decision tree) from multiple sub-samples. Each competing model makes a prediction, and these predictions are averaged. That allowed us to predict up to half defaults correctly.

  • Visualisaiton of Decision Tree regressor revealed how it starts building the tree by separating within the feature ME/TL. Clustering with Radviz 'spring tension' plots confirmed that small cap companies tend to have higher rate of default.

  • It is important to use the Decision Tree Regressor because Decision Tree Classifier creates a very elaborate tree to predict data points on which it was trained close to perfectly.

  • The open question remains if the fitted Decision Tree will be as effective on OUT OF SAMPLE data. Are we any closer to predicting defaults, more specifically, for the large cap companies that are in trouble?

END OF DEMO

Resources

scikit-learn User Guide

online resources and textbooks

  • Quick walk-through with simplifying plots: Logistic Regression, LDA, Decision Trees, Naive Bayes, K-Nearest Neighbors, LVQ, SVM, Bagging and Random Forest, Boosting and AdaBoost -- those are our keywords.

paper textbooks

  • Machine Learning: An Applied Mathematics Introduction, our introduction from Paul Wilmott, 2019. Look into the maths of various kinds of 'distances' between observations. Also understand why quite a few ML techniques (KNN, Decision Trees) might not represent any 'learning'.


  • Big Data and Machine Learning in Quantitative Investment by Tony Guida. A collection of chapters -- each shows different application in finance, to the varied degree of clarity (no code).


  • An Introduction to Statistical Learning with Applications in R by James/Witten/Hastie/Tibshirani PDF DOWNLOAD available from the authors. Book website.


  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie/Tibshirani/Friedman PDF DOWNLOAD is a full-blown theory textbook. Sections such as 2.3.3 From Least Squares to Nearest Neighbors can give an insight for your thinking. Book website.


Slidepacks from Data Science and Machine Learning I, II Modules' lectures, and remember you will have access to Electives that also have ML/Time Series/Python content. This ML Lab's notebook has other links to useful throuhgout and together it -- Additional Python.zip (or alike name on CQF Portal) has Python notebooks from ML Elective and two other relevant Workshops (by Dr Yves Hilpisch).


ADVICE While mathematical clairty is helpful -- ML and AI can be very difficult for learning from textbooks.

You might have a perfect outline of a method (eg, Ridge and Lasso regressions) but not realise that none of it can be appplicable (or make much difference) for financial time series, such as prices of Market Indicies, ETFs, and individual Equities and Commodities.

Equally you might learn everything proper about Crossvalidation and Boostrap Samples (Elements of Statistical Learning) but realise that none of it matters in the situation where there is not much default data.



research sources on corporate defaults

Corporate Defaults

Moody's Default & Recovery Database

Includes more than 500,000 individual debt securities, both corporate and sovereign entities, and default history starting from 1920.

Flexibility: Provides universal identifiers, such as CUSIPs and SIC codes; primary keys for complex querying; and classifications for debt type, rating type, and region, as well as flags.

Altman Z-Score+ analysis of Boeing Company (Ticker : BA)

NYU Research Centers

  • The Altman-Kuehne/NYU Salomon Center Bond Master Default Database

We have now over 3,700 observations on defaulted bond issues covering the period 1971-2016. Data includes SIC, CUSIP, company identification (when available), bond description, date of default, price at default (Recovery), original bond rating, seniority, type of default (bankruptcy, missed interest payment, or distressed exchange) and more.

  • The Altman-Kuehne/NYU Salomon Center Chapter 11 Filings Database

A complete list of corporate bankruptcies under either Chapter X and XII (1971-1978) or Chapter 11 (1979-present) of firms with liabilities equal to or greater than $100 million at time of filing. As of March 2017, the list includes over 2,260 bankruptcies with firm identifiers (when available), bankruptcy date, size of liabilities and SIC.

  • The Altman-Kuehne/NYU Salomon Center Defaulted Bond and Bank Loan Monthly Pricing Databases

Defaulted bond prices are available from 1987 to present (2,078 bonds as of February 2017); defaulted bank loan prices on a monthly basis are available from 1996 through the present (861 loan facilities as of February 2017).


http://www.stern.nyu.edu/experience-stern/about/departments-centers-initiatives/centers-of-research/salomon-center-for-the-study-of-financial-institutions/research-areas/credit-debt-markets/databases

http://pages.stern.nyu.edu/~ealtman/Credit%20&%20Debt%20Markets%20Databases.htm


In [ ]: